Skip to main content

Preprocessing

There are several steps followed for preprocessing the data.

  1. First, the dataset is completed. If the dataframe extracted from MIO is incomplete, i.e., contains only rows with non-zero sales and no rows with 0 sales, thus having a lot of missing dates, the missing rows are added to the dataframe for each brand.

    All dates from a certain starting point (usually 2018-01-01) to the end of desired forecasting period (usually 2025-02-01) are added.

    import pandas as pd

    all_dates = pd.date_range(start="2018-01-01", end="2025-02-01", freq="D")
    all_brand_names = df["brand"].unique()

    all_combinations = pd.DataFrame(
    [(date, brand)
    for date in all_dates
    for brand in all_brand_names],
    columns=["date", "brand"]
    )

    df = pd.merge(all_combinations, df, on=["date", "brand"], how='left')
    df.fillna(0, inplace=True)
    df = df[df["brand"] != 0]

    Missing values (for quantity delivered) are filled with 0 and if any brand is originally NaN and has been filled with 0 too, it is removed.

  2. Next, the data is aggregated to monthly.

    df["date"] = df["date"].dt.to_period("M").dt.to_timestamp()
    df = df.groupby(["brand", "date"])["quantity_delivered"].sum().reset_index()

    product_values = df[(df["date"] >= "2023-01-01") & (df["date"] <= "2023-12-01")].groupby("brand")["quantity_delivered"].sum()
    discontinued_products = product_values.loc[product_values.values < 1]
    df = df[~df["brand"].isin(discontinued_products.index)]
    df.reset_index(drop=True, inplace=True)

    Furthermore, brands which have been discontinued in the region (no sales in 2023) are removed.

  3. Weather data is merged with the dataframe optionally.

    weather_df["date"] = pd.to_datetime(weather_df["date"])
    weather_df["date"] = weather_df["date"].dt.to_period("M").dt.to_timestamp()
    weather_df = weather_df.groupby(["date"]).mean().reset_index()

    weather_df_date_max = weather_df["date"].max()
    previous_year_start = (weather_df_date_max - relativedelta(years=1)).strftime('%Y-%m-%d')
    previous_year_end = (datetime.strptime("2025-02-01", '%Y-%m-%d') - relativedelta(years=1)).strftime('%Y-%m-%d')

    weather_df_features = list(weather_df.columns)
    weather_df_features.remove("date")

    df = pd.merge(df, weather_df, on="date", how='left')

    while df[df["date"] > weather_df_date_max][weather_df_features].isna().sum().sum() > 0:
    df.loc[df["date"] > weather_df_date_max, weather_df_features] = df.loc[(df["date"] > previous_year_start) & (df["date"] <= previous_year_end), weather_df_features].values

    df.fillna(0, inplace=True)

    The weather data for future time periods is filled with weather data of the previous year.